Currently, Zillow does not factor enough local intelligence in producing its housing market predictions, which makes the predictions less accurate than they could be. This project aims for creating a linear regression model to better predict home sales prices in Boston. The model will provide estimated elasticities(multipliers) of a set of variables, showing the magnitude of their influences on home sales prices, which could be used for predicting home prices in locations where home prices data are not accessible. Ideally, the model should produce predictions that not only achieve best “average accuracy”, but also have relatively consistent accuracy in different locations. In other words, the goal is to create a model that has both overall goodness of fit of predicting sample data and generalizability to data that we haven’t seen yet.
To hit a balance between these two is very important for a predictive model, since the value of a predictive model is not to best explain the relationships of interest in the area where we have data, but to use existing data to simulate the relationships of interest in places we do not have data. In reality, home prices in different locations are influenced by specific characteristics attached to the places, which we cannot catch by a single model. Balancing between overall goodness of fit and generalizability would maximize the capture of experiences that are valid across space, namely, those experiences that are valid both in places we have data for and those we do not. This gives the model strong predictive power.
The project is challenging in this way, since when we make the model better at predicting home prices we have data for, we are also running the risk of overfitting and making it less effective in predicting home prices we do not have data for. Furthermore, variables like home prices are not just influenced by spatial features independently, they are also recognized as having positively spatial autocorrelation. That is to say, home prices tend to be similar with each other at close places because they are also influenced by each other. To make the model successful, we should also identify variables that can account for the spatial autocorrelation between home prices.
The first step of developing the model is to collect data for a collection of candidate variables, which based on our knowledge about theories of housing prices, could be categorized as the following three groups: 1) Internal housing characteristics 2) Amenities, public services, and socio-economic environment 3) Home prices nearby The strategy to select the variables for running linear regression with home prices is to do some exploratory analysis, such as using scatterplot, exploratory statistics, GIS overlay mapping as tools to identify the most effective variables. Then, run the linear regression iteratively to add or delete variables in the model based on the changes in goodness of fit and generalizability, until we obtain a model that is satisfactory in both of these two criteria.
Our final model could explain 87.7% of variation in home prices in our dataset. The model is generalizable in different locations, although it performs slightly better in poor and middle-income neighborhoods. There are 27 significant variables included in our final model, of which the most influential variables are neighborhood fixed effects, whether in special zoning districts (restricted parking/neighborhood design/Planned Development Area), road density in residing census tract, whether have a AC, and vacancy rate of residing census tract.
Apart from the internal housing characteristics that attached to the sales information for each house, we gathered amenities, public services, socio-economic environment and neightboring home sales variables based on ACS and Boston Open Data through series of spatial joins and calculations in both R and ArcGIS. The socio-economic environment variables for each house are set to the value of the census tract it locates in. For consistency, we used 2011-2015 ACS as current year and compare it with 2006-2010 ACS to calculate changes and percent changes. The measurements of public services and amenities are distances or counts within specific distance generated by spatial joins. For a better prediction, we also regroup some of the housing characteristics, for instance, exterior finishes, property types, sale season and so on.
In total, we included 27 variables to predict home prices. 7 of them are categories, and the rest are numeric variables. Categorical variables include Property Type, Residential Exemption (Y/N), Building Style, Exterior Finish Structure, Air Conditioning Type, Off-Season Sales (Y/N), and Spatial Zoning District (Y/N).
The rest of the variables relate to land areas, levels, rooms, and decorations of the house itself, distance to differet services, accessbilities, and demographic and economic characterisctics of its location and sales prices of its neighboring buildings. A brief statistical description of these continuous variables as well as the dependent variables is shown below. They are all self-explanary by their names.
| Minimum | Maximum | 1st Quartile | 3rd Quartile | Mean | Median | |
| Sale Price | 200,000 | 11,600,000 | 415,000 | 650,000 | 642,767.900 | 519,500 |
| log(Sale Price) | 12.206 | 16.267 | 12.936 | 13.385 | 13.229 | 13.161 |
| Parcel’sLandArea | 498 | 63,941 | 2,519 | 5,650 | 4,553.962 | 4,222 |
| LatestRemodeledYear | 0 | 2,013 | 0 | 1,998 | 755.185 | 0 |
| TotalLivableArea | 573 | 9,908 | 1,469 | 2,922 | 2,280.976 | 2,100 |
| Num.OfFloors | 1 | 4 | 2 | 3 | 2.181 | 2 |
| Num.OfBedrooms | 1 | 14 | 3 | 6 | 4.497 | 4 |
| Num.OfFullBath | 1 | 8 | 1 | 3 | 1.995 | 2 |
| Num.OfHalfBath | 0 | 3 | 0 | 1 | 0.357 | 0 |
| Num.OfFireplace | 0 | 5 | 0 | 1 | 0.405 | 0 |
| 2015CT MedIncome | 24,286 | 121,096 | 30,943 | 93,819 | 52,434.810 | 30,943 |
| CT Road_Density | 0.010 | 5.011 | 0.035 | 0.103 | 0.087 | 0.066 |
| Dist_Non-publicSchool | 371.547 | 2,681.292 | 888.087 | 1,436.417 | 1,178.495 | 1,118.950 |
| 2015CT EmploymentDensity | 272.487 | 197,897.800 | 881.841 | 3,375.256 | 4,034.451 | 2,029.477 |
| 2015CT ShareOfBachelor | 0.054 | 0.951 | 0.214 | 0.586 | 0.401 | 0.369 |
| 2015CT MedHomeValue | 0 | 1,019,700 | 315,800 | 416,500 | 383,850.000 | 353,000 |
| 2015CT VacancyRate | 0 | 0.237 | 0.040 | 0.086 | 0.065 | 0.061 |
| 2010-2015CT ShareOfBachelor_Chg | -0.351 | 1.810 | 0.027 | 0.531 | 0.301 | 0.223 |
| 2010-2015CT MedGrossRent_Chg | -467 | 2,355 | 11 | 202 | 132.002 | 105 |
| Num.OfCrimes Within0.5mile | 83 | 13,165 | 1,150 | 5,193 | 3,417.238 | 2,600 |
| Num.OfRestaurant Within1mile | 4 | 778 | 28 | 74 | 69.733 | 46 |
| Avg.SalesPrice Within0.25mile | 269,666.700 | 11,600,000 | 451,000 | 644,786 | 644,814.700 | 517,285.700 |
To generate a least-but-best prediction, we want to make sure that each predictor can explain some of the variances in dependent varible, but cannot be linearly explained by other dependant variables. Therefore, we checked the correlation among dependent variable and all the independent variables. The correlation matix are shown as below. The criteria we use to select independent variables are their correlation with the dependent variable and that with other independent variables.
Correlation Matrix of Continuous Variables and Dependent Variable
These independent variables are correlated with both sales price and log of sales price, but they are not that correlated with each other.
The map below illustrates how the sale prices distributes spatially, which is one of basis for finding predictors. It is also a justification for possible spatial autocorrelations, since high prices tend to cluster together and low prices tend to cluster together.
According to the histograms below, sales prices do not form a normal distributiuon, which does not meet the requirements of OLS regression. Therefore, we use log transformed value of sale prices, which distribute normally, as dependent variable.
The justifications of some independent variables are shown below.
Refer to the previous sales price map, houses close to more restaurants are more expensive. People are more likely to pay for better amenities.
According to the map above, houses close to non-public schools have higher value. People are more likely to pay for better school quality.
Sales Price by Neighborhood
According to the map above, houses locate in west and north neighborhood are generally more expensive. Similar housing prices are more likely to cluster together.
The scatterplot above illustrates housing price are correlated with road density. Houses with better connectivity or accessibility are generally more expensive.
According to the boxplot above, different exterior finishes correlates with housing prices.
As mentioned before, the final model is selected after iterative feeding different variables into the calculation and observing the changes in the overall goodness of fit and generalizability.
The overall goodness of fit is judged by the adjusted R-squared of the final model, the percentage of the variations in home prices captured by the model, calculated based on the home sales prices in sample data.
To get a sense of the model’s generalizability, 75% of the sample will be randomly selected and used as training dataset to train the model–calculate the coefficient of each selected variable in the model, while the rest 25% will perform as a test dataset to validate the model generated by the training dataset. The coefficient results in the model are applied to the test dataset to calculate the predicted home sales prices. By comparing the actual home prices in the test dataset with the predicted values, we could know whether the model is generalizable or not. The goal is to minimize the absolute percent difference between the predicted values and observed values in both the training dataset and test dataset in the sample. A Moran’s I test, a test to determine whether values located nearby have statistically significant correlation, is conducted on the errors of predictions, so that we could know whether our model significantly overpredicts or underpredicts home prices at locations close to each other. A good model should generate a random pattern in the errors and have a p-value of Moran’s I test that is larger than 0.05.
Still, the generalizability of the model indicated by the one-time randomly holdout validation as described above could be just out of lucky– the random sample test set we took just happens to be very similar to the training set. To ensure the generalizability of the model, the method of ten-fold cross-validation is used, which generates ten random sub-samples out of our sample data, and conduct 10 holdout test validations. A generalizable model would have similar R-squared across the ten validations.
Also, to examine whether our model is good for different neighborhoods, the MAPE (average percent absolute error of predictions) is calculated for each neighborhood in Boston and visualized on a map. The desired outcome is that the MAPE is relatively close across different neighborhoods. Furthermore, spatial cross-validation is conducted three times, each time we holdout a subsample of home prices from rich neighborhoods, poor neighborhoods, or middle-income neighborhoods. Again, a generalizable model would have similar MAPE (average percent absolute error of predictions) in the three validations.
Based on the training set, our final model is
| Dependent variable: | |
| Log(Sales Price) | |
| Property Type 104 | 0.037** (0.016) |
| Property Type 105 | 0.069*** (0.025) |
| With Residential Exemption | 0.024** (0.010) |
| Parcel’s Land Area | 0.00001*** (0.00000) |
| Latest Remodeled Year | 0.00002*** (0.00000) |
| Total Livable Area | 0.0001*** (0.00001) |
| Num Of Floors | 0.057*** (0.012) |
| Num Of Bedrooms | 0.013*** (0.004) |
| Exterior Finish Structure M | -0.073*** (0.014) |
| Exterior Finish Structure W | -0.072*** (0.016) |
| Num Of Full Bath | 0.034*** (0.010) |
| Num Of Half Bath | 0.026*** (0.010) |
| Type of Air Conditioning N | -0.068*** (0.015) |
| Num Of Fireplace | 0.042*** (0.008) |
| 2015CT Median Income | 0.00000** (0.00000) |
| CT Road Density | -0.248* (0.127) |
| Distance to Non-publicSchool | -0.0001*** (0.00002) |
| 2015CT Employment Density | -0.00000** (0.00000) |
| 2015CT Share Of Bachelor Degree | 0.492*** (0.058) |
| 2015CT Median Home Value | 0.00000** (0.00000) |
| 2015CT Vacancy Rate | -0.475** (0.211) |
| 2010-2015CT Share Of Bachelor Degree Change | -0.040*** (0.015) |
| 2010-2015CT Median Gross Rent Change | -0.00003* (0.00002) |
| Num Of Crimes Within 0.5 mile | -0.00003*** (0.00000) |
| Num Of Restaurant Within 1 mile | 0.001*** (0.0002) |
| Off Season Sales | -0.048*** (0.012) |
| Spatial Zoning District | 0.050*** (0.015) |
| Neighborhood Back Bay | -0.882*** (0.292) |
| Neighborhood Bay Village | 1.384** (0.634) |
| Neighborhood Beacon Hill | -0.123 (0.164) |
| Neighborhood Brighton | -0.223** (0.098) |
| Neighborhood Charlestown | -0.121 (0.090) |
| Neighborhood Dorchester | -0.312*** (0.078) |
| Neighborhood Downtown | 0.041 (0.242) |
| Neighborhood East Boston | -0.411*** (0.095) |
| Neighborhood Fenway | 0.220 (0.149) |
| Neighborhood Hyde Park | -0.451*** (0.080) |
| Neighborhood Jamaica Plain | -0.195** (0.077) |
| Neighborhood Mattapan | -0.364*** (0.079) |
| Neighborhood Mission Hill | -0.063 (0.098) |
| Neighborhood Roslindale | -0.345*** (0.078) |
| Neighborhood Roxbury | -0.366*** (0.082) |
| Neighborhood South Boston | -0.216** (0.090) |
| Neighborhood South End | 0.146 (0.129) |
| Neighborhood West Roxbury | -0.326*** (0.079) |
| Average Sales Price Within 0.25 mile | 0.00000*** (0.00000) |
| Constant | 12.756*** (0.102) |
| Observations | 1,313 |
| R2 | 0.881 |
| Adjusted R2 | 0.877 |
| Residual Std. Error | 0.162 (df = 1266) |
| F Statistic | 203.590*** (df = 46; 1266) |
| Note: | p<0.1; p<0.05; p<0.01 |
The r-square, root mean square error, mean absolute error and MAPE for the training set is
| Rsquared | RootMeanSquareError | MeanAbsError | MeanAbsPercentError | |
| 1 | 0.881 | 382,326.000 | 91,715.490 | 0.121 |
The r-square, root mean square error, mean absolute error and MAPE for the test set is
| Rsquared | RootMeanSquareError | MeanAbsError | MeanAbsPercentError | |
| 1 | 0.926 | 115,458.800 | 75,259.870 | 0.126 |
The comparison between the error in both training set and random set suggest that our model is generalizable, based on the fact that not much statistical difference exists in the two sets.
However, since we only pull out one random set from our model, there still exists some occasional factors. Therefore, we use cross-validation to pull out 10 times and check its mean and variance in Rsquared and MAE.
| RootMeanSquareError | Rsquared | MeanAbsError | |
| Mean | 0.217 | 0.805 | 0.135 |
| StandardDeviation | 0.097 | 0.145 | 0.017 |
According to the two histograms above, we can conclude that the model is overall generalizable. R-squared tends to be clustered in high value whereas MAE tends to be clusted in low value, even though some variances exist.
We also map the the 25% ramdomly selected training set residual as a function of observed and predicted value respectively. The residual stays almost constant with the increase of predicted sales price suggesting a good fit of the model. However, the residual increase with the increase of observed sales price, suggesting that our model failed to include some other variables to explain this variance in the test set. However, the amount is comparatively small compared with our magnitude of our sales prices.
We also conducted moran’s I test for our test set to check spatial autocorrelation in our model. According to the test, there’s no significant spatial autocorrelation in our test set.
| standard deviation | p.value | |
| Moran’s I statistic | -2.341 | 0.990 |
For a better illustration, we map the residuals for the training set and group the mean absolute percent error by neighborhood.
According to theses map, the residual are generally randomly distributed.
For a more precise check of spatial autocorrelation, we pull out a rich, median, and poor neighborhood respectively, build regression for the rest and test them by the pulled out set.
| RootMeanSquareError | MAPE | |
| holdOutRich | 486,488.400 | 0.138 |
| holdOutPoor | 652,038.400 | 0.089 |
| holdOutMiddle | 481,910.300 | 0.108 |
According to the three histograms, the MAPE for all three test are generally low. But the predictive power for our model have some variance on different neighborhood.
Overall, we think this is an effective model. According to the standardized coefficients bar chart, of the ten most influential variables in the model, six are neighborhood fixed effects variable, the rest four are whether in special zoning districts(restricted parking/neighborhood design/Planned Development Area), road density in residing census tract, whether having no AC, and vacancy rate in residing census tract. For the top five, the coefficients could be interpreted as follows: 1) Neighborhood West Roxbury: Assuming all else equal, home prices go down 38.54% on average when located in the neighborhood West Roxbury. 2) Whether in special zoning districts (restricted parking/neighborhood design/Planned Development Area): Assuming all else equal, home prices go up 5.13% on average when located in either one of these three special zones: restricted parking, neighborhood design, and planned development area. 3) Road density in residing census tract: Assuming all else equal, home prices go down 28.15% on average as the number of roads per square mile in the residing census tract increases by 1. 4) Neighborhood South Boston: Assuming all else equal, home prices go down 24.11% on average when located in the neighborhood South Boston. 5) None Air Conditioning: Assuming all else equal, home prices go down 7.04% on average when there is no AC inside.
The map for model residuals of test dataset and the map for MAPE (mean absolute percentage error) by neighborhood both indicate that our model account for the spatial variation in prices to a certain extent, there are slightly discernable spatial patterns, but not very pronouncing. The spatial cross-validation results show that the model performs slightly better in predicting home prices at middle-income and poor neighborhoods than at rich neighborhoods.
There are some other interesting variables we did not incorporate into the model. For example, the accessibility to good-quality schools and accessibility to high-rating restaurants. As mentioned before, the model is better at predicting home prices in middle-income and poor neighborhoods than predicting those in rich neighborhoods. Though our final model includes distance to non-public schools and number of restaurants within 1 mile as two significant variables, they are less specific in accounting for the influence of good amenities in home prices. After all, in many cases of the real world, it is not the general high accessibility to amenities that makes home prices surge, but the high accessibility to really good amenities, such as reputable elementary schools and five-star restaurants. For another, the influence of amenities, and other socio-economic variables, such as distance to nearest college, distance to nearest transit station probably have significant non-linear relationships with home prices, which is also suggested by the scatterplots between them and home prices. In other words, home prices increase exponentially when distance to nearest college or nearest transit station reduces after certain threshold. But in a linear regression model, these variables were kicked out because they are not significantly correlated with home prices in a linear way. This means the most powerful part for explaining those top highest home prices is missed in the model, which is another important reason why the model is performing relatively poorly in rich neighborhoods.
Even though the model is not perfect due to some limitations, we would still like to recommend the model to Zillow. First of all, the model has a reasonably goodness of fit and generalizability. It also accounts for the influence of home prices on nearby home prices. In general, the model has good predictive power.
To further improve the model, several options could be tried:
1) Obtain more data of accessibility to amenities with good quality that could drive up the home prices very fast to improve the predictive accuracy at highest home prices.
2) Consider using non-linear independent variables, such as squared and log variables to better simulate the relationships with home prices.
3) Reduce the geographic scope of spatial lag variable—home sales prices within a certain buffer. In our final model, the scope is a quarter mile buffer, this buffer could be smaller, since home prices could be very different from block to block.